{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# An Introduction to the Amazon SageMaker IP Insights Algorithm\n",
    "#### Unsupervised anomaly detection for susicipous IP addresses\n",
    "-------\n",
    "1. [Introduction](#Introduction)\n",
    "2. [Setup](#Setup)\n",
    "3. [Training](#Training)\n",
    "4. [Inference](#Inference)\n",
    "5. [Epilogue](#Epilogue)\n",
    "\n",
    "## Introduction\n",
    "-------\n",
    "\n",
    "The Amazon SageMaker IP Insights algorithm uses statistical modeling and neural networks to capture associations between online resources (such as account IDs or hostnames) and IPv4 addresses. Under the hood, it learns vector representations for online resources and IP addresses. This essentially means that if the vector representing an IP address and an online resource are close together, then it is likey for that IP address to access that online resource, even if it has never accessed it before.\n",
    "\n",
    "In this notebook, we use the Amazon SageMaker IP Insights algorithm to train a model on synthetic data. We then use this model to perform inference on the data and show how to discover anomalies. After running this notebook, you should be able to:\n",
    "\n",
    "- obtain, transform, and store data for use in Amazon SageMaker,\n",
    "- create an AWS SageMaker training job to produce an IP Insights model,\n",
    "- use the model to perform inference with an Amazon SageMaker endpoint.\n",
    "\n",
    "If you would like to know more, please check out the [SageMaker IP Inisghts Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/ip-insights.html). \n",
    "\n",
    "## Setup\n",
    "------\n",
    "*This notebook was created and tested on a ml.m4.xlarge notebook instance.*\n",
    "\n",
    "Our first step is to setup our AWS credentials so that AWS SageMaker can store and access training data and model artifacts.\n",
    "\n",
    "### Select Amazon S3 Bucket\n",
    "We first need to specify the locations where we will store our training data and trained model artifacts. ***This is the only cell of this notebook that you will need to edit.*** In particular, we need the following data:\n",
    "\n",
    "- `bucket` - An S3 bucket accessible by this account.\n",
    "- `prefix` - The location in the bucket where this notebook's input and output data will be stored. (The default value is sufficient.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import boto3\n",
    "import botocore\n",
    "import os\n",
    "import sagemaker\n",
    "\n",
    "\n",
    "bucket = ''   # <--- specify a bucket you have access to\n",
    "prefix = 'sagemaker/ipinsights-tutorial'\n",
    "execution_role = sagemaker.get_execution_role()\n",
    "\n",
    "\n",
    "# check if the bucket exists\n",
    "try:\n",
    "    boto3.Session().client('s3').head_bucket(Bucket=bucket)\n",
    "except botocore.exceptions.ParamValidationError as e:\n",
    "    print('Hey! You either forgot to specify your S3 bucket'\n",
    "          ' or you gave your bucket an invalid name!')\n",
    "except botocore.exceptions.ClientError as e:\n",
    "    if e.response['Error']['Code'] == '403':\n",
    "        print(\"Hey! You don't have permission to access the bucket, {}.\".format(bucket))\n",
    "    elif e.response['Error']['Code'] == '404':\n",
    "        print(\"Hey! Your bucket, {}, doesn't exist!\".format(bucket))\n",
    "    else:\n",
    "        raise\n",
    "else:\n",
    "    print('Training input/output will be stored in: s3://{}/{}'.format(bucket, prefix))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Dataset\n",
    "\n",
    "Apache Web Server (\"httpd\") is the most popular web server used on the internet. And luckily for us, it logs all requests processed by the server - by default. If a web page requires HTTP authentication, the Apache Web Server will log the IP address and authenticated user name for each requested resource. \n",
    "\n",
    "The [access logs](https://httpd.apache.org/docs/2.4/logs.html) are typically on the server under the file `/var/log/httpd/access_log`. From the example log output below, we see which IP addresses each user has connected with:\n",
    "\n",
    "```\n",
    "192.168.1.100 - user1 [15/Oct/2018:18:58:32 +0000] \"GET /login_success?userId=1 HTTP/1.1\" 200 476 \"-\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36\"\n",
    "192.168.1.102 - user2 [15/Oct/2018:18:58:35 +0000] \"GET /login_success?userId=2 HTTP/1.1\" 200 - \"-\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36\"\n",
    "...\n",
    "```\n",
    "\n",
    "If we want to train an algorithm to detect suspicious activity, this dataset is ideal for SageMaker IP Insights.\n",
    "\n",
    "First, we determine the resource we want to be analyzing (such as a login page or access to a protected file). Then, we construct a dataset containing the history of all past user interactions with the resource. We extract out each 'access event' from the log and store the corresponding user name and IP address in a headerless CSV file with two columns. The first column will contain the user identifier string, and the second will contain the IPv4 address in decimal-dot notation. \n",
    "\n",
    "```\n",
    "user1, 192.168.1.100\n",
    "user2, 193.168.1.102\n",
    "...\n",
    "```\n",
    "\n",
    "As a side note, the dataset should include all access events. That means some `<user_name, ip_address>` pairs will be repeated. \n",
    "\n",
    "#### User Activity Simulation\n",
    "For this example, we are going to simulate our own web-traffic logs. We mock up a toy website example and simulate users logging into the website from mobile devices. \n",
    "\n",
    "The details of the simulation are explained in the script [here](./generate_data.py). \n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Install dependency\n",
    "!pip install tqdm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from generate_data import generate_dataset\n",
    "\n",
    "# We simulate traffic for 10,000 users. This should yield about 3 million log lines (~700 MB). \n",
    "NUM_USERS = 10000\n",
    "log_file = 'ipinsights_web_traffic.log'\n",
    "generate_dataset(NUM_USERS, log_file)\n",
    "\n",
    "# Visualize a few log lines\n",
    "!head $log_file"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Prepare the dataset\n",
    "Now that we have our logs, we need to transform them into a format that IP Insights can use. As we mentioned above, we need to:\n",
    "1. Choose the resource which we want to analyze users' history for\n",
    "2. Extract our users' usage history of IP addresses\n",
    "3. In addition, we want to separate our dataset into a training and test set. This will allow us to check for overfitting by evaluating our model on 'unseen' login events.\n",
    "\n",
    "For the rest of the notebook, we assume that the Apache Access Logs are in the Common Log Format as defined by the [Apache documentation](https://httpd.apache.org/docs/2.4/logs.html#accesslog). We start with reading the logs into a Pandas DataFrame for easy data exploration and pre-processing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df = pd.read_csv(\n",
    "    log_file,\n",
    "    sep=\" \",\n",
    "    na_values='-',\n",
    "    header=None,\n",
    "    names=['ip_address','rcf_id','user','timestamp','time_zone','request', 'status', 'size', 'referer', 'user_agent']\n",
    ")\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We convert the log timestamp strings into Python datetimes so that we can sort and compare the data more easily. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Convert time stamps to DateTime objects\n",
    "df['timestamp'] = pd.to_datetime(df['timestamp'], format='[%d/%b/%Y:%H:%M:%S')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We also verify the time zones of all of the time stamps. If the log contains more than one time zone, we would need to standardize the timestamps."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Check if they are all in the same timezone\n",
    "df['time_zone'].describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we see above, there is only one value in the entire `time_zone` column. Therefore, all of the timestamps are in the same time zone, and we do not need to standardize them. We can skip the next cell and go to [1. Selecting a Resource](#1.-Select-Resource).\n",
    "\n",
    "If there is more than one time_zone in your dataset, then we parse the timezone offset and update the corresponding datetime object. \n",
    "\n",
    "**Note:** The next cell takes about 5-10 minutes to run."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "from datetime import datetime\n",
    "import pytz\n",
    "\n",
    "\n",
    "def apply_timezone(row):\n",
    "    tz = row[1]\n",
    "    tz_offset = int(tz[:3]) * 60   # Hour offset\n",
    "    tz_offset += int(tz[3:5])      # Minuts offset\n",
    "    return row[0].replace(tzinfo=pytz.FixedOffset(tz_offset))\n",
    "\n",
    "\n",
    "df['timestamp'] = df[['timestamp','time_zone']].apply(apply_timezone, axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1. Select Resource\n",
    "Our goal is to train an IP Insights algorithm to analyze the history of user logins such that we can predict how suspicious a login event is. \n",
    "\n",
    "In our simulated web server, the server logs a `GET` request to the `/login_success` page everytime a user successfully logs in. We filter our Apache logs for `GET` requests for `/login_success`. We also filter for requests that have a `status_code == 200`, to ensure that the page request was well formed. \n",
    "\n",
    "**Note:** every web server handles logins differently. For your dataset, determine which resource you will need to be analyzing to correctly frame this problem. Depending on your usecase, you may need to do more data exploration and preprocessing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = df[(df['request'].str.startswith('GET /login_success')) & (df['status'] == 200)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2. Extract Users and IP address\n",
    "Now that our DataFrame only includes log events for the resource we want to analyze, we extract the relevant fields to construct a IP Insights dataset.\n",
    "\n",
    "IP Insights takes in a headerless CSV file with two columns: an entity (username) ID string and the IPv4 address in decimal-dot notation. Fortunately, the Apache Web Server Access Logs output IP addresses and authentcated usernames in their own columns.\n",
    "\n",
    "**Note:** Each website handles user authentication differently. If the Access Log does not output an authenticated user, you could explore the website's query strings or work with your website developers on another solution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = df[['user', 'ip_address', 'timestamp']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3. Create training and test dataset\n",
    "As part of training a model, we want to evaluate how it generalizes to data it has never seen before.\n",
    "\n",
    "Typically, you create a test set by reserving a random percentage of your dataset and evaluating the model after training. However, for machine learning models that make future predictions on historical data, we want to use out-of-time testing. Instead of randomly sampling our dataset, we split our dataset into two contiguous time windows. The first window is the training set, and the second is the test set. \n",
    "\n",
    "We first look at the time range of our dataset to select a date to use as the partition between the training and test set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['timestamp'].describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have login events for 10 days. Let's take the first week (7 days) of data as training and then use the last 3 days for the test set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "time_partition = datetime(2018, 11, 11)\n",
    "\n",
    "# If you needed to normalize the timezones\n",
    "# time_partition = datetime(2018, 11, 11, tzinfo=pytz.FixedOffset(0))\n",
    "\n",
    "train_df = df[df['timestamp'] <= time_partition]\n",
    "test_df = df[df['timestamp'] > time_partition]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have our training dataset, we shuffle it. \n",
    "\n",
    "Shuffling improves the model's performance since SageMaker IP Insights uses stochastic gradient descent. This ensures that login events for the same user are less likely to occur in the same mini batch. This allows the model to improve its performance in between predictions of the same user, which will improve training convergence."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Shuffle train data \n",
    "train_df = train_df.sample(frac=1)\n",
    "train_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Store Data on S3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have simulated (or scraped) our datasets, we have to prepare and upload it to S3.\n",
    "\n",
    "We will be doing local inference, therefore we don't need to upload our test dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Output dataset as headerless CSV \n",
    "train_data = train_df.to_csv(index=False, header=False, columns=['user', 'ip_address'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "# Upload data to S3 key\n",
    "train_data_file = 'train.csv'\n",
    "key = os.path.join(prefix, 'train', train_data_file)\n",
    "s3_train_data = 's3://{}/{}'.format(bucket, key)\n",
    "\n",
    "print('Uploading data to: {}'.format(s3_train_data))\n",
    "boto3.resource('s3').Bucket(bucket).Object(key).put(Body=train_data)\n",
    "\n",
    "# Configure SageMaker IP Insights Input Channels\n",
    "input_data = {\n",
    "    'train': sagemaker.session.s3_input(s3_train_data, distribution='FullyReplicated', content_type='text/csv')\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training\n",
    "---\n",
    "Once the data is preprocessed and available in the necessary format, the next step is to train our model on the data. There are number of parameters required by the SageMaker IP Insights algorithm to configure the model and define the computational environment in which training will take place. The first of these is to point to a container image which holds the algorithms training and hosting code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "from sagemaker.amazon.amazon_estimator import get_image_uri\n",
    "\n",
    "image = get_image_uri(boto3.Session().region_name, 'ipinsights')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, we need to determine the training cluster to use. The IP Insights algorithm supports both CPU and GPU training. We recommend using GPU machines as they will train faster. However, when the size of your dataset increases, it can become more economical to use multiple CPU machines running with distributed training. See [Recommended Instance Types](https://docs.aws.amazon.com/sagemaker/latest/dg/ip-insights.html#ip-insights-instances) for more details. \n",
    "\n",
    "### Training Job Configuration\n",
    "- **train_instance_type**: the instance type to train on. We recommend `p3.2xlarge` for single GPU, `p3.8xlarge` for multi-GPU, and `m5.2xlarge` if using distributed training with CPU;\n",
    "- **train_instance_count**: the number of worker nodes in the training cluster.\n",
    "\n",
    "We need to also configure SageMaker IP Insights-specific hypeparameters:\n",
    "\n",
    "### Model Hyperparameters\n",
    "- **num_entity_vectors**: the total number of embeddings to train. We use an internal hashing mechanism to map the entity ID strings to an embedding index; therefore, using an embedding size larger than the total number of possible values helps reduce the number of hash collisions. We recommend this value to be 2x the total number of unique entites (i.e. user names) in your dataset;\n",
    "- **vector_dim**: the size of the entity and IP embedding vectors. The larger the value, the more information can be encoded using these representations but using too large vector representations may cause the model to overfit, especially for small training data sets;\n",
    "- **num_ip_encoder_layers**: the number of layers in the IP encoder network. The larger the number of layers, the higher the model capacity to capture patterns among IP addresses. However, large number of layers increases the chance of overfitting. `num_ip_encoder_layers=1` is a good value to start experimenting with;\n",
    "- **random_negative_sampling_rate**: the number of randomly generated negative samples to produce per 1 positive sample; `random_negative_sampling_rate=1` is a good value to start experimenting with;\n",
    "    - Random negative samples are produced by drawing each octet from a uniform distributed of [0, 255];\n",
    "- **shuffled_negative_sampling_rate**: the number of shuffled negative samples to produce per 1 positive sample; `shuffled_negative_sampling_rate=1` is a good value to start experimenting with;\n",
    "    - Shuffled negative samples are produced by shuffling the accounts within a batch;\n",
    "\n",
    "### Training Hyperparameters\n",
    "- **epochs**: the number of epochs to train. Increase this value if you continue to see the accuracy and cross entropy improving over the last few epochs;\n",
    "- **mini_batch_size**: how many examples in each mini_batch. A smaller number improves convergence with stochastic gradient descent. But a larger number is necessary if using shuffled_negative_sampling to avoid sampling a wrong account for a negative sample;\n",
    "- **learning_rate**: the learning rate for the Adam optimizer (try ranges in [0.001, 0.1]). Too large learning rate may cause the model to diverge since the training would be likely to overshoot minima. On the other hand, too small learning rate slows down the convergence;\n",
    "- **weight_decay**: L2 regularization coefficient. Regularization is required to prevent the model from overfitting the training data. Too large of a value will prevent the model from learning anything;\n",
    "\n",
    "For more details, see [Amazon SageMaker IP Insights (Hyperparameters)](https://docs.aws.amazon.com/sagemaker/latest/dg/ip-insights-hyperparameters.html). Additionally, most of these hyperparameters can be found using SageMaker Automatic Model Tuning; see [Amazon SageMaker IP Insights (Model Tuning)](https://docs.aws.amazon.com/sagemaker/latest/dg/ip-insights-tuning.html) for more details. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Set up the estimator with training job configuration\n",
    "ip_insights = sagemaker.estimator.Estimator(\n",
    "    image, \n",
    "    execution_role, \n",
    "    train_instance_count=1, \n",
    "    train_instance_type='ml.p3.2xlarge',\n",
    "    output_path='s3://{}/{}/output'.format(bucket, prefix),\n",
    "    sagemaker_session=sagemaker.Session())\n",
    "\n",
    "# Configure algorithm-specific hyperparameters\n",
    "ip_insights.set_hyperparameters(\n",
    "    num_entity_vectors='20000',\n",
    "    random_negative_sampling_rate='5',\n",
    "    vector_dim='128', \n",
    "    mini_batch_size='1000',\n",
    "    epochs='5',\n",
    "    learning_rate='0.01',\n",
    ")\n",
    "\n",
    "# Start the training job (should take about ~1.5 minute / epoch to complete)  \n",
    "ip_insights.fit(input_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you see the message\n",
    "\n",
    "    > Completed - Training job completed\n",
    "\n",
    "at the bottom of the output logs then that means training successfully completed and the output of the SageMaker IP Insights model was stored in the specified output path. You can also view information about and the status of a training job using the AWS SageMaker console. Just click on the \"Jobs\" tab and select training job matching the training job name, below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Training job name: {}'.format(ip_insights.latest_training_job.job_name))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Inference\n",
    "-----\n",
    "\n",
    "Now that we have trained a SageMaker IP Insights model, we can deploy the model to an endpoint to start performing inference on data. In this case, that means providing it a `<user, IP address>` pair and predicting their compatability scores.\n",
    "\n",
    "We can create an inference endpoint using the SageMaker Python SDK `deploy()`function from the job we defined above. We specify the instance type where inference will be performed, as well as the initial number of instnaces to spin up. We recommend using the `ml.m5` instance as it provides the most memory at the lowest cost. Verify how large your model is in S3 and pick the instance type with the appropriate amount of memory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "predictor = ip_insights.deploy(\n",
    "    initial_instance_count=1,\n",
    "    instance_type='ml.m5.xlarge'\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Congratulations, you now have a SageMaker IP Insights inference endpoint! You could start integrating this endpoint with your production services to start querying incoming requests for abnormal behavior. \n",
    "\n",
    "You can confirm the endpoint configuration and status by navigating to the \"Endpoints\" tab in the AWS SageMaker console and selecting the endpoint matching the endpoint name below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Endpoint name: {}'.format(predictor.endpoint))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data Serialization/Deserialization\n",
    "We can pass data in a variety of formats to our inference endpoint. In this example, we will pass CSV-formmated data. Other available formats are JSON-formated and JSON Lines-formatted. We make use of the SageMaker Python SDK utilities: `csv_serializer` and `json_deserializer` when configuring the inference endpoint"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.predictor import csv_serializer, json_deserializer\n",
    "\n",
    "predictor.content_type = 'text/csv'\n",
    "predictor.serializer = csv_serializer\n",
    "predictor.accept = 'application/json'\n",
    "predictor.deserializer = json_deserializer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that the predictor is configured, it is as easy as passing in a matrix of inference data.\n",
    "We can take a few samples from the simulated dataset above, so we can see what the output looks like."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "inference_data = train_df[:5].values\n",
    "predictor.predict(inference_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By default, the predictor will only output the `dot_product` between the learned IP address and the online resource (in this case, the user ID). The dot product summarizes the compatibility between the IP address and online resource. The larger the value, the more the algorithm thinks the IP address is likely to be used by the user. This compatability score is sufficient for most applications, as we can define a threshold for what we constitute as an anomalous score.\n",
    "\n",
    "However, more advanced users may want to inspect the learned embeddings and use them in further applications. We can configure the predictor to provide the learned embeddings by specifing the `verbose=True` parameter to the Accept heading. You should see that each 'prediction' object contains three keys: `ip_embedding`, `entity_embedding`, and `dot_product`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "predictor.accept = 'application/json; verbose=True'\n",
    "predictor.predict(inference_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Compute Anomaly Scores\n",
    "----\n",
    "The `dot_product` output of the model provides a good measure of how compatible an IP address and online resource are. However, the range of the dot_product is unbounded. This means to be able to consider an event as anomolous we need to define a threshold. Such that when we score an event, if the dot_product is above the threshold we can flag the behavior as anomolous.However, picking a threshold can be more of an art, and a good threshold depends on the specifics of your problem and dataset. \n",
    "\n",
    "In the following section, we show how to pick a simple threshold by comparing the score distributions between known normal and malicious traffic:\n",
    "1. We construct a test set of 'Normal' traffic;\n",
    "2. Inject 'Malicious' traffic into the dataset;\n",
    "3. Plot the distribution of dot_product scores for the model on 'Normal' trafic and the 'Malicious' traffic.\n",
    "3. Select a threshold value which separates the normal distribution from the malicious traffic threshold. This value is based on your false-positive tolerance.\n",
    "\n",
    "### 1. Construct 'Normal' Traffic Dataset\n",
    "\n",
    "We previously [created a test set](#3.-Create-training-and-test-dataset) from our simulated Apache access logs dataset. We use this test dataset as the 'Normal' traffic in the test case. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. Inject Malicious Traffic\n",
    "If we had a dataset with enough real malicious activity, we would use that to determine a good threshold. Those are hard to come by. So instead, we simulate malicious web traffic that mimics a realistic attack scenario. \n",
    "\n",
    "We take a set of user accounts from the test set and randomly generate IP addresses. The users should not have used these IP addresses during training. This simulates an attacker logging in to a user account without knowledge of their IP history."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from generate_data import draw_ip\n",
    "\n",
    "# We only need the dot product. Let's reset the predictor output type.\n",
    "predictor.accept = 'application/json; verbose=False'\n",
    "\n",
    "\n",
    "def score_ip_insights(predictor, df):\n",
    "    \n",
    "    def get_score(result):\n",
    "        \"\"\"Return the negative to the dot product of the predictions from the model.\"\"\"\n",
    "        return [-prediction[\"dot_product\"] for prediction in result[\"predictions\"]]\n",
    "    \n",
    "    df = df[['user', 'ip_address']]\n",
    "    result = predictor.predict(df.values)\n",
    "    return get_score(result)\n",
    "\n",
    "\n",
    "def create_test_case(train_df, test_df, num_samples, attack_freq):\n",
    "    \"\"\"Creates a test case from provided train and test data frames. \n",
    "    \n",
    "    This generates test case for accounts that are both in training and testing data sets.\n",
    "\n",
    "    :param train_df: (panda.DataFrame with columns ['user', 'ip_address']) training DataFrame\n",
    "    :param test_df: (panda.DataFrame with columns ['user', 'ip_address']) testing DataFrame\n",
    "    :param num_samples: (int) number of test samples to use\n",
    "    :param attack_freq: (float) the ratio of negative_samples:positive_samples to generate for test case \n",
    "    :return: DataFrame with both good and bad traffic, with labels\n",
    "    \"\"\"\n",
    "    # Get all possible accounts. The IP Insights model can only make predictions on users it has seen in training\n",
    "    # Therefore, filter the test dataset for unseen accounts, as their results will not mean anything.\n",
    "    valid_accounts = set(train_df['user'])\n",
    "    valid_test_df = test_df[test_df['user'].isin(valid_accounts)]\n",
    "\n",
    "    good_traffic = valid_test_df.sample(num_samples, replace=False)\n",
    "    good_traffic = good_traffic[['user', 'ip_address']]\n",
    "    good_traffic['label'] = 0\n",
    "\n",
    "    # Generate malicious traffic\n",
    "    num_bad_traffic = int(num_samples * attack_freq)\n",
    "    bad_traffic_accounts = np.random.choice(list(valid_accounts), size=num_bad_traffic, replace=True) \n",
    "    bad_traffic_ips = [draw_ip() for i in range(num_bad_traffic)]\n",
    "    bad_traffic = pd.DataFrame({'user': bad_traffic_accounts, 'ip_address': bad_traffic_ips})\n",
    "    bad_traffic['label'] = 1\n",
    "    \n",
    "    # All traffic labels are: 0 for good traffic; 1 for bad traffic. \n",
    "    all_traffic = good_traffic.append(bad_traffic)\n",
    "\n",
    "    return all_traffic"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "NUM_SAMPLES = 100000\n",
    "test_case = create_test_case(train_df, test_df, num_samples=NUM_SAMPLES, attack_freq=1)\n",
    "test_case.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_case_scores = score_ip_insights(predictor, test_case)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. Plot Distribution\n",
    "\n",
    "Now, we plot the distribution of scores. Looking at this distribution will inform us on where we can set a good threshold, based on our risk tolerance. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "n, x = np.histogram(test_case_scores[:NUM_SAMPLES], bins=100, density=True)\n",
    "plt.plot(x[1:], n)\n",
    "\n",
    "n, x = np.histogram(test_case_scores[NUM_SAMPLES:], bins=100, density=True)\n",
    "plt.plot(x[1:], n)\n",
    "\n",
    "plt.legend([\"Normal\", \"Random IP\"])\n",
    "plt.xlabel(\"IP Insights Score\")\n",
    "plt.ylabel(\"Frequency\")\n",
    "\n",
    "plt.figure()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4. Selecting a Good Threshold\n",
    "\n",
    "As we see in the figure above, there is a clear separation between normal traffic and random traffic. \n",
    "We could select a threshold depending on the application.\n",
    "\n",
    "- If we were working with low impact decisions, such as whether to ask for another factor or authentication during login, we could use a `threshold = 0.0`. This would result in catching more true-positives, at the cost of more false-positives. \n",
    "\n",
    "- If our decision system were more sensitive to false positives, we could choose a larger threshold, such as `threshold = 10.0`. That way if we were sending the flagged cases to manual investigation, we would have a higher confidence that the acitivty was suspicious. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "threshold = 0.0\n",
    "\n",
    "flagged_cases = test_case[np.array(test_case_scores) > threshold]\n",
    "\n",
    "num_flagged_cases = len(flagged_cases)\n",
    "num_true_positives = len(flagged_cases[flagged_cases['label'] == 1])\n",
    "num_false_positives = len(flagged_cases[flagged_cases['label'] == 0])\n",
    "num_all_positives = len(test_case.loc[test_case['label'] == 1])\n",
    "\n",
    "print(\"When threshold is set to: {}\".format(threshold))\n",
    "print(\"Total of {} flagged cases\".format(num_flagged_cases))\n",
    "print(\"Total of {} flagged cases are true positives\".format(num_true_positives))\n",
    "print(\"True Positive Rate: {}\".format(num_true_positives/float(num_flagged_cases)))\n",
    "print(\"Recall: {}\".format(num_true_positives/float(num_all_positives)))\n",
    "print(\"Precision: {}\".format(num_true_positives/float(num_flagged_cases)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Epilogue\n",
    "----\n",
    "\n",
    "In this notebook, we have showed how to configure the basic training, deployment, and usage of the Amazon SageMaker IP Insights algorithm. All SageMaker algorithms come with support for two additional services that make optimizing and using the algorithm that much easier: Automatic Model Tuning and Batch Transform service. \n",
    "\n",
    "\n",
    "### Amazon SageMaker Automatic Model Tuning\n",
    "The results above were based on using the default hyperparameters of the SageMaker IP Insights algorithm. If we wanted to improve the model's performance even more, we can use [Amazon SageMaker Automatic Model Tuning](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html) to automate the process of finding the hyperparameters. \n",
    "\n",
    "#### Validation Dataset\n",
    "Previously, we separated our dataset into a training and test set to validate the performance of a single IP Insights model. However, when we do model tuning, we train many IP Insights models in parallel. If we were to use the same test dataset to select the best model, we bias our model selection such that we don't know if we selected the best model in general, or just the best model for that particular dateaset. \n",
    "\n",
    "Therefore, we need to separate our test set into a validation dataset and a test dataset. The validation dataset is used for model selection. Then once we pick the model with the best performance, we evaluate it the winner on a test set just as before. \n",
    "\n",
    "#### Validation Metrics\n",
    "For SageMaker Automatic Model Tuning to work, we need an objective metric which determines the performance of the model we want to optimize. Because SageMaker IP Insights is an usupervised algorithm, we do not have a clearly defined metric for performance (such as percentage of fraudulent events discovered). \n",
    "\n",
    "We allow the user to provide a validation set of sample data (same format as training data bove) through the `validation` channel. We then fix the negative sampling strategy to use `random_negative_sampling_rate=1` and `shuffled_negative_sampling_rate=0` and generate a validation dataset by assigning corresponding labels to the real and simulated data. We then calculate the model's `descriminator_auc` metric. We do this by taking the model's predicted labels and the 'true' simulated labels and compute the Area Under ROC Curve (AUC) on the model's performance.\n",
    "\n",
    "We set up the `HyperParameterTuner` to maximize the `discriminator_auc` on the validation dataset. We also need to set the search space for the hyperparameters. We give recommended ranges for the hyperparmaeters in the [Amazon SageMaker IP Insights (Hyperparameters)](https://docs.aws.amazon.com/sagemaker/latest/dg/ip-insights-hyperparameters.html) documentation. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_df['timestamp'].describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The test set we constructed above spans 3 days. We reserve the first day as the validation set and the subsequent two days for the test set. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "time_partition = datetime(2018, 11, 13)\n",
    "\n",
    "# If timestamps have been noramlized with time zone info\n",
    "# time_partition = datetime(2018, 11, 13, tzinfo=pytz.FixedOffset(0))\n",
    "\n",
    "validation_df = test_df[test_df['timestamp'] < time_partition]\n",
    "test_df = test_df[test_df['timestamp'] >= time_partition]\n",
    "\n",
    "valid_data = validation_df.to_csv(index=False, header=False, columns=['user', 'ip_address'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We then upload the validation data to S3 and specify it as the validation channel. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Upload data to S3 key\n",
    "validation_data_file = 'valid.csv'\n",
    "key = os.path.join(prefix, 'validation', validation_data_file)\n",
    "boto3.resource('s3').Bucket(bucket).Object(key).put(Body=valid_data)\n",
    "s3_valid_data = 's3://{}/{}'.format(bucket, key)\n",
    "\n",
    "print('Validation data has been uploaded to: {}'.format(s3_valid_data))\n",
    "\n",
    "# Configure SageMaker IP Insights Input Channels\n",
    "input_data = {\n",
    "    'train': s3_train_data,\n",
    "    'validation': s3_valid_data\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.tuner import HyperparameterTuner, IntegerParameter\n",
    "\n",
    "# Configure HyperparameterTuner\n",
    "ip_insights_tuner = HyperparameterTuner(\n",
    "    estimator=ip_insights,  # previously-configured Estimator object\n",
    "    objective_metric_name='validation:discriminator_auc',\n",
    "    hyperparameter_ranges={'vector_dim': IntegerParameter(64, 1024)},\n",
    "    max_jobs=4,\n",
    "    max_parallel_jobs=2)\n",
    "\n",
    "# Start hyperparameter tuning job\n",
    "ip_insights_tuner.fit(input_data, include_cls_metadata=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Wait for all the jobs to finish\n",
    "ip_insights_tuner.wait()\n",
    "\n",
    "# Visualize training job results\n",
    "ip_insights_tuner.analytics().dataframe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Deploy best model\n",
    "tuned_predictor = ip_insights_tuner.deploy(\n",
    "    initial_instance_count=1, \n",
    "    instance_type='ml.m4.xlarge',\n",
    "    content_type='text/csv',\n",
    "    serializer=csv_serializer,\n",
    "    accept='application/json',\n",
    "    deserializer=json_deserializer\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Make a prediction against the SageMaker endpoint\n",
    "tuned_predictor.predict(inference_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We should have the best performing model from the training job! Now we can determine thresholds and make predictions just like we did with the inference endpoint [above](#Inference)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Batch Transform\n",
    "Let's say we want to score all of the login events at the end of the day and aggregate flagged cases for investigators to look at in the morning. If we store the daily login events in S3, we can use IP Insights with [Amazon SageMaker Batch Transform](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-batch.html) to run inference and store the IP Insights scores back in S3 for future analysis.\n",
    "\n",
    "Below, we take the training job from before and evaluate it on the validation data we put in S3."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "transformer = ip_insights.transformer(\n",
    "    instance_count=1,\n",
    "    instance_type='ml.m4.xlarge',\n",
    ")\n",
    "\n",
    "transformer.transform(\n",
    "    s3_valid_data,\n",
    "    content_type='text/csv',\n",
    "    split_type='Line'\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Wait for Transform Job to finish\n",
    "transformer.wait()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Batch Transform output is at: {}\".format(transformer.output_path))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Stop and Delete the Endpoint\n",
    "If you are done with this model, then we should delete the endpoint before we close the notebook. Or else you will continue to pay for the endpoint while it is running. \n",
    "\n",
    "To do so execute the cell below. Alternately, you can navigate to the \"Endpoints\" tab in the SageMaker console, select the endpoint with the name stored in the variable endpoint_name, and select \"Delete\" from the \"Actions\" dropdown menu."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ip_insights_tuner.delete_endpoint()\n",
    "sagemaker.Session().delete_endpoint(predictor.endpoint)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
